import warnings
warnings.filterwarnings('ignore')
import pandas as pd
df = pd.read_csv("Caravan.csv")
df = df.iloc[:,1:]
df.head(2)
| MOSTYPE | MAANTHUI | MGEMOMV | MGEMLEEF | MOSHOOFD | MGODRK | MGODPR | MGODOV | MGODGE | MRELGE | ... | APERSONG | AGEZONG | AWAOREG | ABRAND | AZEILPL | APLEZIER | AFIETS | AINBOED | ABYSTAND | Purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33 | 1 | 3 | 2 | 8 | 0 | 5 | 1 | 3 | 7 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | No |
| 1 | 37 | 1 | 2 | 2 | 8 | 1 | 4 | 1 | 4 | 6 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | No |
2 rows × 86 columns
Create a 80/20 split with a random state of 19. This will ensure reproducibility.
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
df['Purchase'] = df['Purchase'].map({'No':0,'Yes':1})
df.head(2)
| MOSTYPE | MAANTHUI | MGEMOMV | MGEMLEEF | MOSHOOFD | MGODRK | MGODPR | MGODOV | MGODGE | MRELGE | ... | APERSONG | AGEZONG | AWAOREG | ABRAND | AZEILPL | APLEZIER | AFIETS | AINBOED | ABYSTAND | Purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33 | 1 | 3 | 2 | 8 | 0 | 5 | 1 | 3 | 7 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 37 | 1 | 2 | 2 | 8 | 1 | 4 | 1 | 4 | 6 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 86 columns
# Split the data into features (X) and target (y)
X = df.drop('Purchase', axis=1)
y = df['Purchase']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19)
Fit a boosting model to the training data with Purchase as the outcome variable and the remaining variables as predictors. Try with 1000 trees and a learning rate of 0.01. Which predictors appear to be the most important? (Hint: Use the GradientBoostingClassifier package and feature_importances_ atrribute)
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.01, random_state = 19)
gradient_booster.fit(X_train,y_train)
GradientBoostingClassifier(learning_rate=0.01, n_estimators=1000,
random_state=19)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.01, n_estimators=1000,
random_state=19)from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
feat_importances = pd.Series(gradient_booster.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
<AxesSubplot:>
The top 3 predictors are PPERSAUT, PBRAND and PPLEZIER.
Use the boosting model to predict the outcome variable on the test set. Predict that a person will make purchase if the estimated probability of purchase is greater than 25%. Create a confusion matrix.
y_predict_prob = gradient_booster.predict_proba(X_test)
y_predict_prob
array([[0.97859171, 0.02140829],
[0.82421339, 0.17578661],
[0.9143759 , 0.0856241 ],
...,
[0.98082219, 0.01917781],
[0.85329489, 0.14670511],
[0.98599035, 0.01400965]])
y_predict_prob_class_1 = y_predict_prob[:,1] #This is the prob of purchase
y_predict_prob_class_1
array([0.02140829, 0.17578661, 0.0856241 , ..., 0.01917781, 0.14670511,
0.01400965])
y_predict_class = [1 if prob > 0.25 else 0 for prob in y_predict_prob_class_1]
confusion_matrix(y_test, y_predict_class)
array([[1068, 31],
[ 57, 9]])
boosting_score = accuracy_score(y_test, y_predict_class)
boosting_score
0.9244635193133047
What fraction of the people predicted to make a purchase do in fact make a purchase?
9/(31+9)
0.225
22.5% of the people predicted to make a purchase do in fact make a purchase
How does this result compare with results if you apply KNN, logistic regression, and Random Forest? Include your results in a table.
#KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
y_predict_prob_knn = knn.predict_proba(X_test)
y_predict_prob_class_1_knn = y_predict_prob_knn[:,1]
y_predict_class_knn = [1 if prob > 0.25 else 0 for prob in y_predict_prob_class_1_knn]
knn_score = accuracy_score(y_test, y_predict_class_knn)
knn_score
0.8755364806866953
#logistic regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 19)
classifier.fit(X_train, y_train)
y_predict_prob_lg = classifier.predict_proba(X_test)
y_predict_prob_class_1_lg = y_predict_prob_lg[:,1]
y_predict_class_lg = [1 if prob > 0.25 else 0 for prob in y_predict_prob_class_1_lg]
lg_score = accuracy_score(y_test, y_predict_class_lg)
lg_score
0.927038626609442
#random forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state = 19)
rf.fit(X_train, y_train)
y_predict_prob_rf = rf.predict_proba(X_test)
y_predict_prob_class_1_rf = y_predict_prob_rf[:,1]
y_predict_class_rf = [1 if prob > 0.25 else 0 for prob in y_predict_prob_class_1_rf]
rf_score = accuracy_score(y_test, y_predict_class_rf)
rf_score
0.9021459227467811
data = {'Models': ['Boosting', 'KNN', 'Logistic Regresion', 'Ramdom Forest'],
'Accuracy': [boosting_score, knn_score, lg_score, rf_score]
}
table = pd.DataFrame(data)
print(table.sort_values("Accuracy"))
Models Accuracy 1 KNN 0.875536 3 Ramdom Forest 0.902146 0 Boosting 0.924464 2 Logistic Regresion 0.927039
df2 = pd.read_csv("OJ.csv")
df2.head()
| Unnamed: 0 | Purchase | WeekofPurchase | StoreID | PriceCH | PriceMM | DiscCH | DiscMM | SpecialCH | SpecialMM | LoyalCH | SalePriceMM | SalePriceCH | PriceDiff | Store7 | PctDiscMM | PctDiscCH | ListPriceDiff | STORE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | CH | 237 | 1 | 1.75 | 1.99 | 0.00 | 0.0 | 0 | 0 | 0.500000 | 1.99 | 1.75 | 0.24 | No | 0.000000 | 0.000000 | 0.24 | 1 |
| 1 | 2 | CH | 239 | 1 | 1.75 | 1.99 | 0.00 | 0.3 | 0 | 1 | 0.600000 | 1.69 | 1.75 | -0.06 | No | 0.150754 | 0.000000 | 0.24 | 1 |
| 2 | 3 | CH | 245 | 1 | 1.86 | 2.09 | 0.17 | 0.0 | 0 | 0 | 0.680000 | 2.09 | 1.69 | 0.40 | No | 0.000000 | 0.091398 | 0.23 | 1 |
| 3 | 4 | MM | 227 | 1 | 1.69 | 1.69 | 0.00 | 0.0 | 0 | 0 | 0.400000 | 1.69 | 1.69 | 0.00 | No | 0.000000 | 0.000000 | 0.00 | 1 |
| 4 | 5 | CH | 228 | 7 | 1.69 | 1.69 | 0.00 | 0.0 | 0 | 0 | 0.956535 | 1.69 | 1.69 | 0.00 | Yes | 0.000000 | 0.000000 | 0.00 | 0 |
df2 = df2.iloc[:,1:]
df2['Purchase'] = df2['Purchase'].map({'CH':0,'MM':1})
df2['Store7'] = df2['Store7'].map({'No':0,'Yes':1})
df2.head()
| Purchase | WeekofPurchase | StoreID | PriceCH | PriceMM | DiscCH | DiscMM | SpecialCH | SpecialMM | LoyalCH | SalePriceMM | SalePriceCH | PriceDiff | Store7 | PctDiscMM | PctDiscCH | ListPriceDiff | STORE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 237 | 1 | 1.75 | 1.99 | 0.00 | 0.0 | 0 | 0 | 0.500000 | 1.99 | 1.75 | 0.24 | 0 | 0.000000 | 0.000000 | 0.24 | 1 |
| 1 | 0 | 239 | 1 | 1.75 | 1.99 | 0.00 | 0.3 | 0 | 1 | 0.600000 | 1.69 | 1.75 | -0.06 | 0 | 0.150754 | 0.000000 | 0.24 | 1 |
| 2 | 0 | 245 | 1 | 1.86 | 2.09 | 0.17 | 0.0 | 0 | 0 | 0.680000 | 2.09 | 1.69 | 0.40 | 0 | 0.000000 | 0.091398 | 0.23 | 1 |
| 3 | 1 | 227 | 1 | 1.69 | 1.69 | 0.00 | 0.0 | 0 | 0 | 0.400000 | 1.69 | 1.69 | 0.00 | 0 | 0.000000 | 0.000000 | 0.00 | 1 |
| 4 | 0 | 228 | 7 | 1.69 | 1.69 | 0.00 | 0.0 | 0 | 0 | 0.956535 | 1.69 | 1.69 | 0.00 | 1 | 0.000000 | 0.000000 | 0.00 | 0 |
Create a training set containing 70\% of the observations in the data. The test set containing the remaining 30\% observations.
# Split the data into features (X) and target (y)
X = df2.drop('Purchase', axis=1)
y = df2['Purchase']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit a tree to the training data, with Purchase as the response and the other variables as the predictors. Use the get_params() function to produce summary statistics about the tree, and describe the results obtained.
What is the training error rate? How many terminal nodes does the tree have?
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state = 42)
clf.fit(X_train, y_train)
clf.get_params()
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 42,
'splitter': 'best'}
predictions = clf.predict(X_train)
from sklearn.metrics import accuracy_score
training_error_rate = 1- accuracy_score(y_train, predictions)
print(training_error_rate) #training error rate
0.008010680907877155
clf.get_n_leaves() #There are 137 leaves/terminal nodes in the tree
137
Plot the tree and interpret the results. (Hint: use the dtreeviz package in Python)
#!pip install graphviz
import graphviz
#!pip install dtreeviz
import dtreeviz
viz_model = dtreeviz.model(clf,
X_train, y_train,
feature_names=X_train.columns.tolist(),
target_name='Purchase')
viz_model.view()
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
predictions = clf.predict(X_test)
test_error_rate = 1 - accuracy_score(y_test, predictions)
print(test_error_rate) #test error rate
0.2928348909657321
confusion_matrix(y_test, predictions)
array([[148, 45],
[ 49, 79]])
Apply the GridSearchCV to the training set in order to determine the optimal tree size. Use the following parameter grid:
tree_param = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50],
'min_samples_split': [2,3,4,5,6], 'min_samples_leaf' : [2,3,4,5,6]}
from sklearn.model_selection import GridSearchCV
tree_param = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50],
'min_samples_split': [2,3,4,5,6], 'min_samples_leaf' : [2,3,4,5,6]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), tree_param, verbose=3, cv=5)
import sys
import contextlib
with contextlib.redirect_stdout(None): #limiting verbosity
grid_search_cv.fit(X_train, y_train)
clf = grid_search_cv.best_estimator_
grid_search_cv.best_params_
{'criterion': 'gini',
'max_depth': 5,
'min_samples_leaf': 3,
'min_samples_split': 2}
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, min_samples_leaf=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=5, min_samples_leaf=3, random_state=42)
clf.tree_.node_count
#optimal tree size is 51
51
predictions = clf.predict(X_test)
test_error_rate = 1 - accuracy_score(y_test, predictions)
print(test_error_rate) #test error rate
0.2461059190031153
predictions_train = clf.predict(X_train)
train_error_rate = 1 - accuracy_score(y_train, predictions_train)
print(train_error_rate) #train error rate
0.12416555407209617
Plot the tree size on the x-axis and cross-validated classification error rate on the y-axis. Which tree size corresponds to the lowest cross-validated classification error rate?
tree_cv = pd.DataFrame(grid_search_cv.cv_results_)
tree_cv['tree_size'] = ''
tree_cv['error_rate'] = ''
tree_cv.head()
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_depth | param_min_samples_leaf | param_min_samples_split | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | tree_size | error_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.012579 | 0.011842 | 0.015275 | 0.021898 | gini | 4 | 2 | 2 | {'criterion': 'gini', 'max_depth': 4, 'min_sam... | 0.793333 | 0.8 | 0.84 | 0.813333 | 0.845638 | 0.818461 | 0.020981 | 44 | ||
| 1 | 0.009833 | 0.001676 | 0.007060 | 0.002132 | gini | 4 | 2 | 3 | {'criterion': 'gini', 'max_depth': 4, 'min_sam... | 0.793333 | 0.8 | 0.84 | 0.813333 | 0.845638 | 0.818461 | 0.020981 | 44 | ||
| 2 | 0.013317 | 0.007152 | 0.006985 | 0.005544 | gini | 4 | 2 | 4 | {'criterion': 'gini', 'max_depth': 4, 'min_sam... | 0.793333 | 0.8 | 0.84 | 0.813333 | 0.845638 | 0.818461 | 0.020981 | 44 | ||
| 3 | 0.008314 | 0.002541 | 0.006488 | 0.005324 | gini | 4 | 2 | 5 | {'criterion': 'gini', 'max_depth': 4, 'min_sam... | 0.793333 | 0.8 | 0.84 | 0.813333 | 0.845638 | 0.818461 | 0.020981 | 44 | ||
| 4 | 0.009239 | 0.001614 | 0.005794 | 0.000798 | gini | 4 | 2 | 6 | {'criterion': 'gini', 'max_depth': 4, 'min_sam... | 0.793333 | 0.8 | 0.84 | 0.813333 | 0.845638 | 0.818461 | 0.020981 | 44 |
i = 0
while i < len(tree_cv):
clf = DecisionTreeClassifier(criterion = tree_cv.loc[i, 'param_criterion'], max_depth= tree_cv.loc[i, 'param_max_depth'],
min_samples_leaf= tree_cv.loc[i, 'param_min_samples_leaf'],
min_samples_split = tree_cv.loc[i, 'param_min_samples_split'], random_state=42)
clf.fit(X_train, y_train)
tree_cv.loc[i, 'tree_size'] = clf.tree_.node_count
predictions = clf.predict(X_test)
tree_cv.loc[i, 'error_rate'] = 1 - accuracy_score(y_test, predictions)
i += 1
#To draw the graph, we need to group by tree size. The minimum classification error will be used
tree_cv_grouped = tree_cv.groupby(['tree_size']).agg({'error_rate':'min'})
tree_cv_grouped = tree_cv_grouped.reset_index()
tree_cv_grouped
| tree_size | error_rate | |
|---|---|---|
| 0 | 25 | 0.255452 |
| 1 | 27 | 0.199377 |
| 2 | 29 | 0.190031 |
| 3 | 43 | 0.205607 |
| 4 | 45 | 0.196262 |
| ... | ... | ... |
| 65 | 217 | 0.255452 |
| 66 | 219 | 0.277259 |
| 67 | 221 | 0.277259 |
| 68 | 231 | 0.242991 |
| 69 | 233 | 0.242991 |
70 rows × 2 columns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.grid()
plt.plot(tree_cv_grouped['tree_size'], tree_cv_grouped['error_rate'])
plt.xlabel("Tree Size")
plt.ylabel("Classification Error Rate")
Text(0, 0.5, 'Classification Error Rate')
The lowest error rate has tree size of approximately 20, according to this graph.
Produce a pruned tree corresponding to The optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
clf = grid_search_cv.best_estimator_
path = clf.cost_complexity_pruning_path(X_train, y_train)
path
{'ccp_alphas': array([0. , 0.00037887, 0.00057755, 0.00093458, 0.0012016 ,
0.00144637, 0.00233038, 0.00238924, 0.00247578, 0.00304979,
0.00330043, 0.00379766, 0.00406541, 0.00480764, 0.00505792,
0.00516478, 0.00530909, 0.01439439, 0.02019872, 0.02223703,
0.17499343]),
'impurities': array([0.17740075, 0.17777962, 0.17835717, 0.17929175, 0.18049335,
0.1833861 , 0.18571648, 0.18810572, 0.19058149, 0.19668108,
0.19998151, 0.20377916, 0.21190999, 0.22152526, 0.23164111,
0.23680589, 0.24211498, 0.25650937, 0.27670809, 0.29894511,
0.47393855])}
ccp_alphas, impurities = path.ccp_alphas, path.impurities
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
from sklearn.metrics import accuracy_score
acc_scores = [accuracy_score(y_test, clf.predict(X_test)) for clf in clfs]
tree_depths = [clf.tree_.max_depth for clf in clfs]
plt.figure(figsize=(10, 6))
plt.grid()
plt.plot(ccp_alphas[:-1], acc_scores[:-1])
plt.xlabel("effective alpha")
plt.ylabel("Accuracy scores")
Text(0, 0.5, 'Accuracy scores')
According to this graph, the highest accuracy has alpha between 0.005 and 0.015. I will use alpha of 0.01 which is in between.
tree = DecisionTreeClassifier(criterion= 'gini', max_depth=5, min_samples_leaf=3,
min_samples_split = 2, random_state=42, ccp_alpha = 0.01)
tree.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.01, max_depth=5, min_samples_leaf=3,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.01, max_depth=5, min_samples_leaf=3,
random_state=42)predictions = tree.predict(X_test)
test_error_rate = 1 - accuracy_score(y_test, predictions)
print(test_error_rate)
0.19937694704049846
predictions_train = tree.predict(X_train)
train_error_rate = 1 - accuracy_score(y_train, predictions_train)
print(train_error_rate)
0.157543391188251
Compare the training error rates and test error rates between the pruned and unpruned trees. Which is higher?
For the test error rate, unpruned tree was 0.2461059190031153 and pruned tree was 0.19937694704049846. Unpruned tree's error rate was higher so pruning the tree improved accuracy for test data
For the training error rate, unpruned tree was 0.12416555407209 and pruned tree was 0.157543391188251. Pruned tree error rate was higher so pruning the tree doensn't improved accuracy for training data
This is showing that pruning the tree has less overfitting on training data hence it increases the accuracy on test data.